In the followings a simplified description of the variables:
limit_bal: The amount of the given credit (NT dollar)
sex : Gender
education: Level of education
marriage: marital status
age: Age of the customers
payment_status_(month): Status of payment in one of the previous 6 months
bill_statement_(month): The amount of bill statements (NT dollars)in one of the previous 6 months
previous_payment_(month): The amount of previous payments (NT dollars) in one of the previous 6 months
The target variable ['default_payment_next_month'] indicates whether the customer defaulted on the payment in the followings month.
1.1.1 Import the libraries:
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import numpy as np
import random
import seaborn as sns
import plotly.express as px
import plotly.io as pio
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = [8, 4.5]
plt.rcParams['figure.dpi'] = 300
warnings.simplefilter(action='ignore', category=FutureWarning)
df = pd.read_csv('credit_card_default.csv', index_col=0, na_values='')
print(f' DataFrame enthält {len(df)} Zeilen und {df.shape[1]} Spalten.')
df = pd.read_csv('credit_card_default.csv', index_col=0, na_values='') ###
df.head(5)
DataFrame enthält 30000 Zeilen und 24 Spalten.
| limit_bal | sex | education | marriage | age | payment_status_sep | payment_status_aug | payment_status_jul | payment_status_jun | payment_status_may | ... | bill_statement_jun | bill_statement_may | bill_statement_apr | previous_payment_sep | previous_payment_aug | previous_payment_jul | previous_payment_jun | previous_payment_may | previous_payment_apr | default_payment_next_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20000 | Female | University | Married | 24.0 | Payment delayed 2 months | Payment delayed 2 months | Payed duly | Payed duly | Unknown | ... | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 | 1 |
| 1 | 120000 | Female | University | Single | 26.0 | Payed duly | Payment delayed 2 months | Unknown | Unknown | Unknown | ... | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 | 1 |
| 2 | 90000 | Female | University | Single | 34.0 | Unknown | Unknown | Unknown | Unknown | Unknown | ... | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 | 0 |
| 3 | 50000 | Female | University | Married | 37.0 | Unknown | Unknown | Unknown | Unknown | Unknown | ... | 28314 | 28959 | 29547 | 2000 | 2019 | 1200 | 1100 | 1069 | 1000 | 0 |
| 4 | 50000 | Male | University | Married | 57.0 | Payed duly | Unknown | Payed duly | Unknown | Unknown | ... | 20940 | 19146 | 19131 | 2000 | 36681 | 10000 | 9000 | 689 | 679 | 0 |
5 rows × 24 columns
1.1.2 Separate the features from the target (y):
X = df.copy()
y = X.pop('default_payment_next_month') # separate our Features and our Target
1.1.3 Inspecting the data types:
df.dtypes # show the data Types of our Dataset
limit_bal int64 sex object education object marriage object age float64 payment_status_sep object payment_status_aug object payment_status_jul object payment_status_jun object payment_status_may object payment_status_apr object bill_statement_sep int64 bill_statement_aug int64 bill_statement_jul int64 bill_statement_jun int64 bill_statement_may int64 bill_statement_apr int64 previous_payment_sep int64 previous_payment_aug int64 previous_payment_jul int64 previous_payment_jun int64 previous_payment_may int64 previous_payment_apr int64 default_payment_next_month int64 dtype: object
1.1.4 Memory optimization:
def get_df_memory_usage(df, top_columns=5):
print('Memory usage ----')
memory_per_column = df.memory_usage(deep=True) / 1024 ** 2
print(f'Top {top_columns} columns by memory (MB):')
print(memory_per_column.sort_values(ascending=False) \
.head(top_columns))
print(f'Total size: {memory_per_column.sum():.4f} MB')
get_df_memory_usage(df, 5)
Memory usage ---- Top 5 columns by memory (MB): education 1.965001 payment_status_sep 1.954342 payment_status_aug 1.920288 payment_status_jul 1.916343 payment_status_jun 1.904229 dtype: float64 Total size: 20.7012 MB
1.1.5. Convert object columns to categorical :
df_cat = df.copy()
object_columns = df_cat.select_dtypes(include= 'object').columns
df_cat[object_columns] = df_cat[object_columns].astype('category')
get_df_memory_usage(df_cat)
Memory usage ---- Top 5 columns by memory (MB): Index 0.228882 bill_statement_aug 0.228882 previous_payment_apr 0.228882 previous_payment_may 0.228882 previous_payment_jun 0.228882 dtype: float64 Total size: 3.9265 MB
1.2.1 Zusammenfassende Statistiken für numerische Variablen:
df.describe().transpose().round(2)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| limit_bal | 30000.0 | 167484.32 | 129747.66 | 10000.0 | 50000.00 | 140000.0 | 240000.00 | 1000000.0 |
| age | 29850.0 | 35.49 | 9.22 | 21.0 | 28.00 | 34.0 | 41.00 | 79.0 |
| bill_statement_sep | 30000.0 | 51223.33 | 73635.86 | -165580.0 | 3558.75 | 22381.5 | 67091.00 | 964511.0 |
| bill_statement_aug | 30000.0 | 49179.08 | 71173.77 | -69777.0 | 2984.75 | 21200.0 | 64006.25 | 983931.0 |
| bill_statement_jul | 30000.0 | 47013.15 | 69349.39 | -157264.0 | 2666.25 | 20088.5 | 60164.75 | 1664089.0 |
| bill_statement_jun | 30000.0 | 43262.95 | 64332.86 | -170000.0 | 2326.75 | 19052.0 | 54506.00 | 891586.0 |
| bill_statement_may | 30000.0 | 40311.40 | 60797.16 | -81334.0 | 1763.00 | 18104.5 | 50190.50 | 927171.0 |
| bill_statement_apr | 30000.0 | 38871.76 | 59554.11 | -339603.0 | 1256.00 | 17071.0 | 49198.25 | 961664.0 |
| previous_payment_sep | 30000.0 | 5663.58 | 16563.28 | 0.0 | 1000.00 | 2100.0 | 5006.00 | 873552.0 |
| previous_payment_aug | 30000.0 | 5921.16 | 23040.87 | 0.0 | 833.00 | 2009.0 | 5000.00 | 1684259.0 |
| previous_payment_jul | 30000.0 | 5225.68 | 17606.96 | 0.0 | 390.00 | 1800.0 | 4505.00 | 896040.0 |
| previous_payment_jun | 30000.0 | 4826.08 | 15666.16 | 0.0 | 296.00 | 1500.0 | 4013.25 | 621000.0 |
| previous_payment_may | 30000.0 | 4799.39 | 15278.31 | 0.0 | 252.50 | 1500.0 | 4031.50 | 426529.0 |
| previous_payment_apr | 30000.0 | 5215.50 | 17777.47 | 0.0 | 117.75 | 1500.0 | 4000.00 | 528666.0 |
| default_payment_next_month | 30000.0 | 0.22 | 0.42 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
1.2.2 Zusammenfassende Statistiken für kategoriale Variablen:
df.describe(include= 'object').transpose()
| count | unique | top | freq | |
|---|---|---|---|---|
| sex | 29850 | 2 | Female | 18027 |
| education | 29850 | 4 | University | 13960 |
| marriage | 29850 | 3 | Single | 15891 |
| payment_status_sep | 30000 | 10 | Unknown | 17496 |
| payment_status_aug | 30000 | 10 | Unknown | 19512 |
| payment_status_jul | 30000 | 10 | Unknown | 19849 |
| payment_status_jun | 30000 | 10 | Unknown | 20803 |
| payment_status_may | 30000 | 9 | Unknown | 21493 |
| payment_status_apr | 30000 | 9 | Unknown | 21181 |
1.2.3 Die Altersverteilung und nach Geschlecht :
fig, ax = plt.subplots()
sns.distplot(df.loc[df.sex == 'Male','age'].dropna(),
hist= False, color= 'green',
kde_kws= {'shade': True},
ax= ax, label= 'Mänlich')
sns.distplot(df.loc[df.sex == 'Female','age'].dropna(),
hist= False, color= 'blue',
kde_kws={'shade': True},
ax= ax, label= 'Weiblich')
ax.set_title('Altersverteilung')
ax.legend(title='Geschlecht:')
plt.tight_layout()
plt.show()
ax = sns.distplot(df.age.dropna(), )
ax.set_title('Altersverteilung');
Kommentar :
Wir haben festgestellt, dass alle ~10 Jahre einige Spitzen auftreten, was auf das Binning zurückzuführen ist. Nachfolgend haben wir das gleiche Histogramm mit sns.countplot und plotly_express erstellt. Auf diese Weise hat jeder Alterswert ein eigenes Feld, und wir können das Diagramm im Detail untersuchen. In den folgenden Diagrammen gibt es keine derartigen Ausschläge:
1.2.4 Die Altersverteilung und nach Geschlecht (Histogramm mit sns.countplot und plotly_express):
plot_ = sns.countplot(x=df.age.dropna(), color= 'blue')
for ind, label in enumerate(plot_.get_xticklabels()):
if int(float(label.get_text())) % 10 == 0:
label.set_visible(True)
else:
label.set_visible(False)
#px.histogram(df, x='age', title = 'Altersverteilung')
Wir können die Geschlechter trennen, indem wir das Argument hue angeben:
pair_plot = sns.pairplot(df[['sex', 'age', 'limit_bal', 'previous_payment_sep']], hue='sex')
pair_plot.fig.suptitle('Pairplot of selected variables', y=1.05);
1.2.6 Definieren eine Funktion zur Darstellung der Korrelations-Heatmap:
def plot_correlation_matrix(corr_mat):
'''
Parameters
----------
corr_mat : pd.DataFrame
Correlation matrix of the features.
'''
# temporarily change style
sns.set(style='white')
# mask the upper triangle
mask = np.zeros_like(corr_mat,dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# set up the matplotlib figure
fig, ax = plt.subplots()
# set up custom diverging colormap
cmap = sns.diverging_palette(240, 10, n=9, as_cmap=True)
# plot the heatmap
sns.heatmap(corr_mat, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5,
cbar_kws={'shrink': .5}, ax=ax)
ax.set_title('Correlation Matrix', fontsize=10)
# change back to darkgrid style
sns.set(style='darkgrid')
corr_mat = df.select_dtypes(include='number').corr()
plot_correlation_matrix(corr_mat)
plt.tight_layout()
plt.show()
<ipython-input-15-8188309a5dd9>:12: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations mask = np.zeros_like(corr_mat,dtype=np.bool)
Kommentar:
Aus der PLOT geht hervor, dass das Alter mit keinem feature korreliert
Wir können die Korrelation zwischen den (numerischen) features und dem Zielvariable prüfen:
df.select_dtypes(include='number').corr()[['default_payment_next_month']]
| default_payment_next_month | |
|---|---|
| limit_bal | -0.153520 |
| age | 0.014491 |
| bill_statement_sep | -0.019644 |
| bill_statement_aug | -0.014193 |
| bill_statement_jul | -0.014076 |
| bill_statement_jun | -0.010156 |
| bill_statement_may | -0.006760 |
| bill_statement_apr | -0.005372 |
| previous_payment_sep | -0.072929 |
| previous_payment_aug | -0.058579 |
| previous_payment_jul | -0.056250 |
| previous_payment_jun | -0.056827 |
| previous_payment_may | -0.055124 |
| previous_payment_apr | -0.053183 |
| default_payment_next_month | 1.000000 |
1.2.7 Plotten die Verteilung des limit balance für jedes Geschlecht und Bildungsgrad:
ax = sns.violinplot(x='education', y='limit_bal', hue='sex', split=True, data=df)
ax.set_title('Distribution of limit balance per education level',fontsize=16)
plt.tight_layout()
plt.show()
ax = sns.violinplot(x='education', y='limit_bal', hue='sex', data=df)
ax.set_title('Distribution of limit balance per education level',fontsize=16);
Kommentar:
Nach Recherchen wurden einige Muster gefunden:
1.2.8 Untersuchen die Verteilung der Zielvariablen nach Geschlecht und Bildungsgrad:
ax = sns.countplot('default_payment_next_month', hue='sex', data=df, orient='h')
ax.set_title('Verteilung der Zielvariablen', fontsize=16)
plt.tight_layout()
plt.show()
Kommentar:
- Aus der Studie geht hervor, dass der höchste Prozentsatz an Zahlungsausfällen bei männlichen Kunden zu verzeichnen ist.
1.2.9 Untersuchung des Prozentsatzes der Zahlungsausfälle nach Bildungsstand:
ax = df.groupby('education')['default_payment_next_month'] \
.value_counts(normalize=True) \
.unstack() \
.plot(kind='barh', stacked='True')
ax.set_title('Prozentsatzes der Zahlungsausfälle nach Bildungsgrad',
fontsize=16)
ax.legend(title='Default', bbox_to_anchor=(1,1))
plt.tight_layout()
plt.show()
Kommentar:
- Auf der Grundlage von Untersuchungen kann man sagen, dass Zahlungsausfälle häufiger bei Kunden mit "High school"
und seltener bei Kunden aus die Kategorie "Sonstige" auftreten
1.3.1 Import the function from sklearn:
from sklearn.model_selection import train_test_split
1.3.2 Split the data into training and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
1.3.3 Split the data into training and test sets without shuffling:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, shuffle=False)
1.3.4 Split the data into training and test sets with stratification:
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
stratify=y,
random_state=42)
Kommentar:
20 % - TestSet, 80 % - TrainSet
"stratify=y" - Random split für unausgeglichene Daten
1.3.5 Verify that the ratio of the target is preserved:
y_train.value_counts(normalize=True)
0 0.778792 1 0.221208 Name: default_payment_next_month, dtype: float64
y_test.value_counts(normalize=True)
0 0.778833 1 0.221167 Name: default_payment_next_month, dtype: float64
Kommentar: in beiden Stichproben liegt die Kreditausfallquote bei etwa 22,12 %
# define the size of the validation and test sets
VALID_SIZE = 0.1
TEST_SIZE = 0.2
# create the initial split - training and temp
X_train, X_temp, y_train, y_temp = train_test_split(X, y,
test_size=(VALID_SIZE + TEST_SIZE),
stratify=y,
random_state=42)
# calculate the new test size
NEW_TEST_SIZE = np.around(TEST_SIZE / (VALID_SIZE + TEST_SIZE), 2)
# create the valid and test sets
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp,
test_size=NEW_TEST_SIZE,
stratify=y_temp,
random_state=42)
1.5.1 Import the libraries:
import pandas as pd
import missingno # conda install -c conda-forge missingno
from sklearn.impute import SimpleImputer
1.5.2 Inspect the information about the DataFrame:
X.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 30000 entries, 0 to 29999 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 limit_bal 30000 non-null int64 1 sex 29850 non-null object 2 education 29850 non-null object 3 marriage 29850 non-null object 4 age 29850 non-null float64 5 payment_status_sep 30000 non-null object 6 payment_status_aug 30000 non-null object 7 payment_status_jul 30000 non-null object 8 payment_status_jun 30000 non-null object 9 payment_status_may 30000 non-null object 10 payment_status_apr 30000 non-null object 11 bill_statement_sep 30000 non-null int64 12 bill_statement_aug 30000 non-null int64 13 bill_statement_jul 30000 non-null int64 14 bill_statement_jun 30000 non-null int64 15 bill_statement_may 30000 non-null int64 16 bill_statement_apr 30000 non-null int64 17 previous_payment_sep 30000 non-null int64 18 previous_payment_aug 30000 non-null int64 19 previous_payment_jul 30000 non-null int64 20 previous_payment_jun 30000 non-null int64 21 previous_payment_may 30000 non-null int64 22 previous_payment_apr 30000 non-null int64 dtypes: float64(1), int64(13), object(9) memory usage: 6.5+ MB
1.5.3 Visualize the nullity of the DataFrame:
missingno.matrix(X)
plt.show()
Kommentar: Weiße Balken in den Datenspalten informieren uns über fehlende Werte in den 23 Spalten und Zeilen mit fehlenden Werten.
1.5.4. Define columns with missing values per data type:
NUM_FEATURES = ['age']
CAT_FEATURES = ['sex', 'education', 'marriage']
for col in NUM_FEATURES:
num_imputer = SimpleImputer(strategy='median')
num_imputer.fit(X_train[[col]])
X_train.loc[:, col] = num_imputer.transform(X_train[[col]])
X_test.loc[:, col] = num_imputer.transform(X_test[[col]])
C:\Users\alfa\anaconda3\lib\site-packages\pandas\core\indexing.py:1738: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(loc, value[:, i].tolist(), pi) C:\Users\alfa\anaconda3\lib\site-packages\pandas\core\indexing.py:1738: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(loc, value[:, i].tolist(), pi)
Verwenden den SimpleImputer (strategy='median'), um fehlende Werte in Spalte ['age'] aufzufüllen
1.5.6. Impute the categorical features:
for col in CAT_FEATURES:
cat_imputer = SimpleImputer(strategy='most_frequent')
cat_imputer.fit(X_train[[col]])
X_train.loc[:, col] = cat_imputer.transform(X_train[[col]])
X_test.loc[:, col] = cat_imputer.transform(X_test[[col]])
C:\Users\alfa\anaconda3\lib\site-packages\pandas\core\indexing.py:1738: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(loc, value[:, i].tolist(), pi) C:\Users\alfa\anaconda3\lib\site-packages\pandas\core\indexing.py:1738: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(loc, value[:, i].tolist(), pi) C:\Users\alfa\anaconda3\lib\site-packages\pandas\core\indexing.py:1738: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(loc, value[:, i].tolist(), pi) C:\Users\alfa\anaconda3\lib\site-packages\pandas\core\indexing.py:1738: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(loc, value[:, i].tolist(), pi) C:\Users\alfa\anaconda3\lib\site-packages\pandas\core\indexing.py:1738: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(loc, value[:, i].tolist(), pi) C:\Users\alfa\anaconda3\lib\site-packages\pandas\core\indexing.py:1738: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_column(loc, value[:, i].tolist(), pi)
Verwenden den SimpleImputer (strategy='most_frequent') um fehlende Werte in Spalten ['sex', 'education', 'marriage'] aufzufüllen
1.5.7. Verify that there are no missing values:
X_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 20999 entries, 25553 to 27126 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 limit_bal 20999 non-null int64 1 sex 20999 non-null object 2 education 20999 non-null object 3 marriage 20999 non-null object 4 age 20999 non-null float64 5 payment_status_sep 20999 non-null object 6 payment_status_aug 20999 non-null object 7 payment_status_jul 20999 non-null object 8 payment_status_jun 20999 non-null object 9 payment_status_may 20999 non-null object 10 payment_status_apr 20999 non-null object 11 bill_statement_sep 20999 non-null int64 12 bill_statement_aug 20999 non-null int64 13 bill_statement_jul 20999 non-null int64 14 bill_statement_jun 20999 non-null int64 15 bill_statement_may 20999 non-null int64 16 bill_statement_apr 20999 non-null int64 17 previous_payment_sep 20999 non-null int64 18 previous_payment_aug 20999 non-null int64 19 previous_payment_jul 20999 non-null int64 20 previous_payment_jun 20999 non-null int64 21 previous_payment_may 20999 non-null int64 22 previous_payment_apr 20999 non-null int64 dtypes: float64(1), int64(13), object(9) memory usage: 3.8+ MB
1.6.1. Import the libraries:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
CAT_FEATURES = X_train.select_dtypes(include='object') \
.columns \
.to_list()
one_hot_encoder = OneHotEncoder(sparse=False,
handle_unknown='error',
drop='first')
one_hot_transformer = ColumnTransformer(
[("one_hot", one_hot_encoder, CAT_FEATURES)]
#,remainder='passthrough'
)
one_hot_transformer.fit(X_train)
ColumnTransformer(transformers=[('one_hot',
OneHotEncoder(drop='first', sparse=False),
['sex', 'education', 'marriage',
'payment_status_sep', 'payment_status_aug',
'payment_status_jul', 'payment_status_jun',
'payment_status_may',
'payment_status_apr'])])
col_names = one_hot_transformer.get_feature_names()
X_train_cat = pd.DataFrame(one_hot_transformer.transform(X_train),
columns=col_names,
index=X_train.index)
X_train_ohe = pd.concat([X_train, X_train_cat], axis=1) \
.drop(CAT_FEATURES, axis=1)
X_test_cat = pd.DataFrame(one_hot_transformer.transform(X_test),
columns=col_names,
index=X_test.index)
X_test_ohe = pd.concat([X_test, X_test_cat], axis=1) \
.drop(CAT_FEATURES, axis=1)
X_train_cat
| one_hot__x0_Male | one_hot__x1_High school | one_hot__x1_Others | one_hot__x1_University | one_hot__x2_Others | one_hot__x2_Single | one_hot__x3_Payment delayed 1 month | one_hot__x3_Payment delayed 2 months | one_hot__x3_Payment delayed 3 months | one_hot__x3_Payment delayed 4 months | ... | one_hot__x7_Payment delayed 8 months | one_hot__x7_Unknown | one_hot__x8_Payment delayed 2 months | one_hot__x8_Payment delayed 3 months | one_hot__x8_Payment delayed 4 months | one_hot__x8_Payment delayed 5 months | one_hot__x8_Payment delayed 6 months | one_hot__x8_Payment delayed 7 months | one_hot__x8_Payment delayed 8 months | one_hot__x8_Unknown | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25553 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 14463 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 27267 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 18620 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 12238 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25780 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 13921 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3794 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 27565 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 27126 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
20999 rows × 58 columns
X_train_ohe
| limit_bal | age | bill_statement_sep | bill_statement_aug | bill_statement_jul | bill_statement_jun | bill_statement_may | bill_statement_apr | previous_payment_sep | previous_payment_aug | ... | one_hot__x7_Payment delayed 8 months | one_hot__x7_Unknown | one_hot__x8_Payment delayed 2 months | one_hot__x8_Payment delayed 3 months | one_hot__x8_Payment delayed 4 months | one_hot__x8_Payment delayed 5 months | one_hot__x8_Payment delayed 6 months | one_hot__x8_Payment delayed 7 months | one_hot__x8_Payment delayed 8 months | one_hot__x8_Unknown | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25553 | 320000 | 37.0 | 202442 | 187475 | 164694 | 148160 | 132230 | 121191 | 8211 | 6100 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 14463 | 500000 | 35.0 | 1369 | 6138 | 20424 | 7840 | 846 | 790 | 4769 | 19629 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 27267 | 160000 | 42.0 | 14137 | 13613 | 14634 | 16532 | 15969 | 17701 | 0 | 1247 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 18620 | 20000 | 26.0 | 1000 | 8930 | 0 | 0 | 0 | 790 | 8930 | 0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 12238 | 20000 | 22.0 | 11999 | 3617 | 4165 | 6323 | 0 | 0 | 1062 | 1000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25780 | 200000 | 32.0 | 10701 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 13921 | 120000 | 24.0 | 112336 | 113351 | 115515 | 113948 | 122127 | 121962 | 4200 | 4100 | ... | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3794 | 120000 | 24.0 | 75796 | 76004 | 67187 | 49924 | 33188 | 19826 | 3700 | 2023 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 27565 | 360000 | 57.0 | 0 | 0 | 860 | 246 | -46 | -46 | 0 | 860 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 27126 | 300000 | 35.0 | 1246 | 1217 | 338 | 0 | 0 | 0 | 1217 | 338 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
20999 rows × 72 columns
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
from performance import performance_evaluation_report
from io import StringIO
import seaborn as sns
from ipywidgets import Image
import pydotplus # conda install -c conda-forge pydotplus
tree_classifier = DecisionTreeClassifier(random_state=42)
tree_classifier.fit(X_train_ohe, y_train)
y_pred = tree_classifier.predict(X_test_ohe)
LABELS = ['No Default', 'Default']
tree_perf = performance_evaluation_report(tree_classifier,
X_test_ohe,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve = True)
plt.tight_layout()
plt.show()
Bewertung unseres Algorithmus und Modells
tree_perf
{'accuracy': 0.7154700712982922,
'precision': 0.36736111111111114,
'recall': 0.39655172413793105,
'specificity': 0.8060464126037896,
'f1_score': 0.3813987022350397,
'cohens_kappa': 0.19699436859223496,
'roc_auc': 0.6016888830441072,
'pr_auc': 0.44881290572571897}
small_tree = DecisionTreeClassifier(max_depth=4,
random_state=42)
small_tree.fit(X_train_ohe, y_train)
tree_dot = StringIO()
export_graphviz(small_tree, feature_names=X_train_ohe.columns,
class_names=LABELS, rounded=True, out_file=tree_dot,
proportion=False, precision=2, filled=True)
tree_graph = pydotplus.graph_from_dot_data(tree_dot.getvalue())
tree_graph.set_dpi(300)
Image(value=tree_graph.create_png())
y_pred_prob = tree_classifier.predict_proba(X_test_ohe)[:, 1]
precision, recall, thresholds = metrics.precision_recall_curve(y_test,
y_pred_prob)
ax = plt.subplot()
ax.plot(recall, precision,
label=f'PR-AUC = {metrics.auc(recall, precision):.2f}')
ax.set(title='Precision-Recall Curve',
xlabel='Recall',
ylabel='Precision')
ax.legend()
plt.tight_layout()
plt.show()
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from performance import performance_evaluation_report
df = pd.read_csv('credit_card_default.csv',
index_col=0, na_values='')
X = df.copy()
y = X.pop('default_payment_next_month')
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
stratify=y,
random_state=42)
num_features = X_train.select_dtypes(include='number') \
.columns \
.to_list()
cat_features = X_train.select_dtypes(include='object') \
.columns \
.to_list()
num_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median'))
])
cat_list = [list(X_train[col].dropna().unique()) for col in cat_features]
cat_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(categories=cat_list, sparse=False,
handle_unknown='error', drop='first'))
])
preprocessor = ColumnTransformer(transformers=[
('numerical', num_pipeline, num_features),
('categorical', cat_pipeline, cat_features)],
remainder='drop')
dec_tree = DecisionTreeClassifier(random_state=42)
tree_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', dec_tree)])
tree_pipeline.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('numerical',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['limit_bal', 'age',
'bill_statement_sep',
'bill_statement_aug',
'bill_statement_jul',
'bill_statement_jun',
'bill_statement_may',
'bill_statement_apr',
'previous_payment_sep',
'previous_payment_aug',
'previous_payment_j...
'3 '
'months',
'Payment '
'delayed '
'7 '
'months',
'Payment '
'delayed '
'5 '
'months',
'Payment '
'delayed '
'8 '
'months']],
drop='first',
sparse=False))]),
['sex', 'education',
'marriage',
'payment_status_sep',
'payment_status_aug',
'payment_status_jul',
'payment_status_jun',
'payment_status_may',
'payment_status_apr'])])),
('classifier', DecisionTreeClassifier(random_state=42))])
LABELS = ['No Default', 'Default']
tree_perf = performance_evaluation_report(tree_pipeline, X_test,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve = True)
plt.tight_layout()
plt.show()
Bewertung unseres Algorithmus und Modells
tree_perf
{'accuracy': 0.7233333333333334,
'precision': 0.3816631130063966,
'recall': 0.4046721929163527,
'specificity': 0.8138240958698909,
'f1_score': 0.3928310168251646,
'cohens_kappa': 0.21388003714653614,
'roc_auc': 0.6095279347712678,
'pr_auc': 0.4589251149708047}
from sklearn.base import BaseEstimator, TransformerMixin
class OutlierRemover(BaseEstimator, TransformerMixin):
def __init__(self, n_std=3):
self.n_std = n_std
def fit(self, X, y = None):
if np.isnan(X).any(axis=None):
raise ValueError('''There are missing values in the array!
Please remove them.''')
mean_vec = np.mean(X, axis=0)
std_vec = np.std(X, axis=0)
self.upper_band_ = mean_vec + self.n_std * std_vec
self.lower_band_ = mean_vec - self.n_std * std_vec
self.n_features_ = len(self.upper_band_)
return self
def transform(self, X, y = None):
X_copy = pd.DataFrame(X.copy())
upper_band = np.repeat(
self.upper_band_.reshape(self.n_features_, -1),
len(X_copy),
axis=1).transpose()
lower_band = np.repeat(
self.lower_band_.reshape(self.n_features_, -1),
len(X_copy),
axis=1).transpose()
X_copy[X_copy >= upper_band] = upper_band
X_copy[X_copy <= lower_band] = lower_band
return X_copy.values
num_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('outliers', OutlierRemover())
])
preprocessor = ColumnTransformer(transformers=[
('numerical', num_pipeline, num_features),
('categorical', cat_pipeline, cat_features)],
remainder='drop')
dec_tree = DecisionTreeClassifier(random_state=42)
tree_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', dec_tree)])
tree_pipeline.fit(X_train, y_train)
tree_perf = performance_evaluation_report(tree_pipeline, X_test,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve = True)
plt.tight_layout()
plt.show()
_
Bewertung unseres Algorithmus und Modells
tree_perf
{'accuracy': 0.7203333333333334,
'precision': 0.377529658060014,
'recall': 0.4076865109269028,
'specificity': 0.8091161994436122,
'f1_score': 0.39202898550724635,
'cohens_kappa': 0.21077497538963086,
'roc_auc': 0.608982787005664,
'pr_auc': 0.4582134167657523}
from sklearn.model_selection import (GridSearchCV, cross_val_score,RandomizedSearchCV, cross_validate,StratifiedKFold)
from sklearn import metrics
k_fold = StratifiedKFold(5, shuffle=True, random_state=42)
cross_val_score(tree_pipeline, X_train, y_train, cv=k_fold)
array([0.72333333, 0.72958333, 0.71375 , 0.723125 , 0.72 ])
cross_validate(tree_pipeline, X_train, y_train, cv=k_fold,
scoring=['accuracy', 'precision', 'recall',
'roc_auc'])
{'fit_time': array([0.40790272, 0.42912531, 0.41360974, 0.40620208, 0.4146595 ]),
'score_time': array([0.04928374, 0.05789733, 0.05372834, 0.0586493 , 0.0519464 ]),
'test_accuracy': array([0.72333333, 0.72958333, 0.71375 , 0.723125 , 0.72 ]),
'test_precision': array([0.38560411, 0.39557522, 0.36978297, 0.38638298, 0.37674825]),
'test_recall': array([0.42412818, 0.42090395, 0.41713748, 0.42749529, 0.40583804]),
'test_roc_auc': array([0.61633282, 0.61893804, 0.60794039, 0.61806569, 0.60712913])}
param_grid = {'classifier__criterion': ['entropy', 'gini'],
'classifier__max_depth': range(3, 11),
'classifier__min_samples_leaf': range(2, 11),
'preprocessor__numerical__outliers__n_std': [3, 4]}
classifier_gs = GridSearchCV(tree_pipeline, param_grid, scoring='recall',
cv=k_fold, n_jobs=-1, verbose=1)
classifier_gs.fit(X_train, y_train)
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
estimator=Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('numerical',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median')),
('outliers',
OutlierRemover())]),
['limit_bal',
'age',
'bill_statement_sep',
'bill_statement_aug',
'bill_statement_jul',
'bill_statement...
'payment_status_jul',
'payment_status_jun',
'payment_status_may',
'payment_status_apr'])])),
('classifier',
DecisionTreeClassifier(random_state=42))]),
n_jobs=-1,
param_grid={'classifier__criterion': ['entropy', 'gini'],
'classifier__max_depth': range(3, 11),
'classifier__min_samples_leaf': range(2, 11),
'preprocessor__numerical__outliers__n_std': [3, 4]},
scoring='recall', verbose=1)
print(f'Best parameters: {classifier_gs.best_params_}')
print(f'Recall (Training set): {classifier_gs.best_score_:.4f}')
print(f'Recall (Test set): {metrics.recall_score(y_test, classifier_gs.predict(X_test)):.4f}')
Best parameters: {'classifier__criterion': 'gini', 'classifier__max_depth': 10, 'classifier__min_samples_leaf': 7, 'preprocessor__numerical__outliers__n_std': 4}
Recall (Training set): 0.3858
Recall (Test set): 0.3775
LABELS = ['No Default', 'Default']
tree_gs_perf = performance_evaluation_report(classifier_gs, X_test,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve = True)
plt.tight_layout()
plt.show()
Bewertung unseres Algorithmus und Modells
tree_gs_perf
{'accuracy': 0.8031666666666667,
'precision': 0.5852803738317757,
'recall': 0.3775433308214017,
'specificity': 0.9240316713032314,
'f1_score': 0.45900137425561155,
'cohens_kappa': 0.34547526291831954,
'roc_auc': 0.7205480311384921,
'pr_auc': 0.4698784893189859}
classifier_rs = RandomizedSearchCV(tree_pipeline, param_grid, scoring='recall',
cv=k_fold, n_jobs=-1, verbose=1,
n_iter=100, random_state=42)
classifier_rs.fit(X_train, y_train)
Fitting 5 folds for each of 100 candidates, totalling 500 fits
RandomizedSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
estimator=Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('numerical',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median')),
('outliers',
OutlierRemover())]),
['limit_bal',
'age',
'bill_statement_sep',
'bill_statement_aug',
'bill_statement_jul',
'bill_sta...
'payment_status_may',
'payment_status_apr'])])),
('classifier',
DecisionTreeClassifier(random_state=42))]),
n_iter=100, n_jobs=-1,
param_distributions={'classifier__criterion': ['entropy',
'gini'],
'classifier__max_depth': range(3, 11),
'classifier__min_samples_leaf': range(2, 11),
'preprocessor__numerical__outliers__n_std': [3,
4]},
random_state=42, scoring='recall', verbose=1)
print(f'Best parameters: {classifier_rs.best_params_}')
print(f'Recall (Training set): {classifier_rs.best_score_:.4f}')
print(f'Recall (Test set): {metrics.recall_score(y_test, classifier_rs.predict(X_test)):.4f}')
Best parameters: {'preprocessor__numerical__outliers__n_std': 3, 'classifier__min_samples_leaf': 7, 'classifier__max_depth': 10, 'classifier__criterion': 'gini'}
Recall (Training set): 0.3854
Recall (Test set): 0.3760
tree_rs_perf = performance_evaluation_report(classifier_rs, X_test,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve = True)
plt.tight_layout()
plt.show()
Bewertung unseres Algorithmus und Modells
tree_rs_perf
{'accuracy': 0.8028333333333333,
'precision': 0.5843091334894613,
'recall': 0.3760361718161266,
'specificity': 0.9240316713032314,
'f1_score': 0.45758826226501603,
'cohens_kappa': 0.3439613201516819,
'roc_auc': 0.7188330048148133,
'pr_auc': 0.4686663337519278}
Prepare the Pipeline:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from performance import performance_evaluation_report
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
from sklearn.ensemble import (RandomForestClassifier,
GradientBoostingClassifier)
num_features = X_train.select_dtypes(include='number').columns.to_list() cat_features = X_train.select_dtypes(include='object').columns.to_list()
num_pipeline = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')) ])
cat_list = [list(X_train[column].dropna().unique()) for column in cat_features]
cat_pipeline = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(categories=cat_list, sparse=False, handle_unknown='error', drop='first')) ])
preprocessor = ColumnTransformer(transformers=[ ('numerical', num_pipeline, num_features), ('categorical', cat_pipeline, cat_features)], remainder='drop')
rf = RandomForestClassifier(random_state=42) # I make an instance from the class
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', rf)
])
rf_pipeline.fit(X_train, y_train)
LABELS = ['No Default', 'Default']
rf_perf = performance_evaluation_report(rf_pipeline, X_test,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve=True)
# plt.savefig('images/ch9_im1.png', dpi=300)
plt.show()
rf_perf
{'accuracy': 0.8126666666666666,
'precision': 0.6373477672530447,
'recall': 0.3549359457422758,
'specificity': 0.9426492617162422,
'f1_score': 0.4559535333978703,
'cohens_kappa': 0.3536945117892293,
'roc_auc': 0.7521730520421391,
'pr_auc': 0.5277687262649688}
gbt = GradientBoostingClassifier(random_state=42)
gbt_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', gbt)
])
gbt_pipeline.fit(X_train, y_train)
LABELS = ['No Default', 'Default']
gbt_perf = performance_evaluation_report(gbt_pipeline, X_test,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve=True)
# plt.savefig('images/ch9_im2.png', dpi=300)
plt.show()
gbt_perf
{'accuracy': 0.8153333333333334,
'precision': 0.664167916041979,
'recall': 0.333835719668425,
'specificity': 0.9520650545687995,
'f1_score': 0.4443329989969909,
'cohens_kappa': 0.3478377308833953,
'roc_auc': 0.7754929753263589,
'pr_auc': 0.5484953190087277}
Below we go over the most important hyperparameters of the considered models and show a possible way of tuning them using Randomized Search. With more complex models, the training time is significantly longer than with the basic Decision Tree, so we need to find a balance between the time we want to spend on tuning the hyperparameters and the expected results. Also, bear in mind that changing the values of some parameters (such as learning rate or the number of estimators) can itself influence the training time of the models.
To have the results in a reasonable amount of time, we used the Randomized Search with 100 different sets of hyperparameters for each model (the number of actually fitted models is higher due to cross-validation). Just as in the recipe Grid Search and Cross-Validation, we used recall as the criterion for selecting the best model. Additionally, we used the scikit-learn compatible APIs of XGBoost and LightGBM to make the process as easy to follow as possible. For a complete list of hyperparameters and their meaning, please refer to corresponding documentations.
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn import metrics
import numpy as np
N_SEARCHES = 100
k_fold = StratifiedKFold(5, shuffle=True, random_state=42)
Random Forest
When tuning the Random Forest classifier, we look at the following hyperparameters (there are more available for tuning):
n_estimators - the number of decision trees in a forest.¶max_features - the maximum number of features considered for splitting a node. The default is the square root of the number of features. When None, all features are considered.¶max_depth - the maximum number of levels in each decision tree¶min_samples_split - the minimum number of observations required to split each node. When set to high it may cause underfitting, as the trees will not split enough times.¶min_samples_leaf - the minimum number of data points allowed in a leaf. Too small a value might cause overfitting, while large values might prevent the tree from growing and cause underfitting.¶bootstrap - whether to use bootstrapping for each tree in the forest¶rf_param_grid = {'classifier__n_estimators': np.linspace(100, 1000, 10, dtype=int),
'classifier__max_features': ['log2', 'sqrt', None],
'classifier__max_depth': np.arange(3, 11, 1, dtype=int),
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': np.arange(1, 51, 2, dtype=int),
'classifier__bootstrap': [True, False]}
And use the randomized search to tune the classifier:
rf_rs = RandomizedSearchCV(rf_pipeline, rf_param_grid, scoring='recall',
cv=k_fold, n_jobs=-1, verbose=1,
n_iter=N_SEARCHES, random_state=42)
rf_rs.fit(X_train, y_train)
print(f'Best parameters: {rf_rs.best_params_}')
print(f'Recall (Training set): {rf_rs.best_score_:.4f}')
print(f'Recall (Test set): {metrics.recall_score(y_test, rf_rs.predict(X_test)):.4f}')
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best parameters: {'classifier__n_estimators': 1000, 'classifier__min_samples_split': 2, 'classifier__min_samples_leaf': 11, 'classifier__max_features': None, 'classifier__max_depth': 9, 'classifier__bootstrap': False}
Recall (Training set): 0.3750
Recall (Test set): 0.3451
rf_rs_perf = performance_evaluation_report(rf_rs, X_test,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve=True)
rf_rs_perf
{'accuracy': 0.8088333333333333,
'precision': 0.6222826086956522,
'recall': 0.34513941220798794,
'specificity': 0.9405093087952066,
'f1_score': 0.4440135724672807,
'cohens_kappa': 0.3398343312239752,
'roc_auc': 0.7321475596715472,
'pr_auc': 0.49177918264811576}
Gradient Boosted Trees
As Gradient Boosted Trees are also an ensemble method built on top of decision trees, a lot of the parameters are the same as in the case of the Random Forest. The new one is the learning rate, which is used in the gradient descent algorithm to control the rate of descent towards the minimum of the loss function. When tuning the tree manually, we should consider this hyperparameter together with the number of estimators, as reducing the learning rate (the learning is slower), while increasing the number of estimators can increase the computation time significantly.
We define the grid as follows:
gbt_param_grid = {'classifier__n_estimators': np.linspace(100, 1000, 10, dtype=int),
'classifier__learning_rate': np.arange(0.05, 0.31, 0.05),
'classifier__max_depth': np.arange(3, 11, 1, dtype=int),
'classifier__min_samples_split': np.linspace(0.1, 0.5, 12),
'classifier__min_samples_leaf': np.arange(1, 51, 2, dtype=int),
'classifier__max_features':['log2', 'sqrt', None]}
And run the randomized search:
gbt_rs = RandomizedSearchCV(gbt_pipeline, gbt_param_grid, scoring='recall',
cv=k_fold, n_jobs=-1, verbose=1,
n_iter=N_SEARCHES, random_state=42)
gbt_rs.fit(X_train, y_train)
print(f'Best parameters: {gbt_rs.best_params_}')
print(f'Recall (Training set): {gbt_rs.best_score_:.4f}')
print(f'Recall (Test set): {metrics.recall_score(y_test, gbt_rs.predict(X_test)):.4f}')
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best parameters: {'classifier__n_estimators': 1000, 'classifier__min_samples_split': 0.2090909090909091, 'classifier__min_samples_leaf': 15, 'classifier__max_features': None, 'classifier__max_depth': 6, 'classifier__learning_rate': 0.25}
Recall (Training set): 0.3731
Recall (Test set): 0.3497
gbt_rs_perf = performance_evaluation_report(gbt_rs, X_test,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve=True)
gbt_rs_perf
{'accuracy': 0.8048333333333333,
'precision': 0.6010362694300518,
'recall': 0.3496608892238131,
'specificity': 0.9340894500320993,
'f1_score': 0.4421152929966651,
'cohens_kappa': 0.33371973668937804,
'roc_auc': 0.7496943673117111,
'pr_auc': 0.5070296242054544}
Import the libraries:
from xgboost.sklearn import XGBClassifier # conda install -c conda-forge xgboost
from lightgbm import LGBMClassifier # conda install -c conda-forge lightgbm
xgb = XGBClassifier(random_state=42)
xgb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', xgb)
])
xgb_pipeline.fit(X_train, y_train)
xgb_perf = performance_evaluation_report(xgb_pipeline, X_test,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve=True)
# plt.savefig('images/ch9_im3.png', dpi=300)
plt.show()
C:\Users\alfa\anaconda3\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1]. warnings.warn(label_encoder_deprecation_msg, UserWarning)
[15:35:55] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
lgbm = LGBMClassifier(random_state=42)
lgbm_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', lgbm)
])
lgbm_pipeline.fit(X_train, y_train)
lgbm_perf = performance_evaluation_report(lgbm_pipeline, X_test,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve=True)
# plt.savefig('images/ch9_im4.png', dpi=300)
plt.show()
XGBoost
The scikit-learn API of XGBoost makes sure that the hyperparameters are named similarly to their equivalents other scikit-learn's classifiers. So the XGBoost native eta hyperparameter is called learning_rate in scikit-learn's API.
The new hyperparameters we consider for this example are:
min_child_weight - indicates the minimum sum of weights of all observations required in a child. This hyperparameter is used for controlling overfitting. Cross-validation should be used for tuning.colsample_bytree - indicates the fraction of columns to be randomly sampled for each tree.We define the grid as:
xgb_param_grid = {'classifier__n_estimators': np.linspace(100, 1000, 10, dtype=int),
'classifier__learning_rate': np.arange(0.05, 0.31, 0.05),
'classifier__max_depth': np.arange(3, 11, 1, dtype=int),
'classifier__min_child_weight': np.arange(1, 8, 1, dtype=int),
'classifier__colsample_bytree': np.linspace(0.3, 1, 7)}
For defining ranges of parameters that are restricted (such as colsample_bytree which cannot be higher than 1.0) it is better to use np.linspace rather than np.arange, because the latter allows for some inconsistencies when the step is defined as floating-point. For example, the last value might be 1.0000000002, which then causes an error while training the classifier.
xgb_rs = RandomizedSearchCV(xgb_pipeline, # Estimator (either a "score" function or "scoring" must be calculated)
xgb_param_grid, # param_distributions: dict (the previous cell)
scoring='recall', # default=None, The score comes from the class "performance_evaluation_report()"
cv=k_fold, # determines the cross validation splitting strategy
n_jobs=-1, # number of jobs to run parallel. -1 means using all processors.
verbose=1, # the higher is the value the more messages
n_iter=N_SEARCHES, # iterations, in this case 100
random_state=42)
xgb_rs.fit(X_train, y_train)
# RSCV Att => cv_results_, best_estimator_, best_score_, best_params_, best_index_, scorer_
print(f'Best parameters: {xgb_rs.best_params_}')
print(f'Recall (Training set): {xgb_rs.best_score_:.4f}')
print(f'Recall (Test set): {metrics.recall_score(y_test, xgb_rs.predict(X_test)):.4f}')
Fitting 5 folds for each of 100 candidates, totalling 500 fits
C:\Users\alfa\anaconda3\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1]. warnings.warn(label_encoder_deprecation_msg, UserWarning)
[16:22:24] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Best parameters: {'classifier__n_estimators': 600, 'classifier__min_child_weight': 5, 'classifier__max_depth': 9, 'classifier__learning_rate': 0.3, 'classifier__colsample_bytree': 1.0}
Recall (Training set): 0.3826
Recall (Test set): 0.3723
xgb_rs_perf = performance_evaluation_report(xgb_rs, X_test,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve=True)
LightGBM
We tune the same parameters as in XGBoost, though more is definitely possible and encouraged. The grid is defined as follows:
lgbm_param_grid = {'classifier__n_estimators': np.linspace(100, 1000, 10, dtype=int),
'classifier__learning_rate': np.arange(0.05, 0.31, 0.05),
'classifier__max_depth': np.arange(3, 11, 1, dtype=int),
'classifier__colsample_bytree': np.linspace(0.3, 1, 7)}
lgbm_rs = RandomizedSearchCV(lgbm_pipeline, lgbm_param_grid, scoring='recall',
cv=k_fold, n_jobs=-1, verbose=1,
n_iter=N_SEARCHES, random_state=42)
lgbm_rs.fit(X_train, y_train)
print(f'Best parameters: {lgbm_rs.best_params_}')
print(f'Recall (Training set): {lgbm_rs.best_score_:.4f}')
print(f'Recall (Test set): {metrics.recall_score(y_test, lgbm_rs.predict(X_test)):.4f}')
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best parameters: {'classifier__n_estimators': 900, 'classifier__max_depth': 5, 'classifier__learning_rate': 0.25, 'classifier__colsample_bytree': 0.8833333333333333}
Recall (Training set): 0.3758
Recall (Test set): 0.3723
lgbm_rs_perf = performance_evaluation_report(lgbm_rs, X_test,
y_test, labels=LABELS,
show_plot=True,
show_pr_curve=True)
Below we present a summary of all the classifiers we have considered in the last 3 recipes.
results_dict = {'decision_tree_baseline': tree_perf,
'random_forest': rf_perf,
'random_forest_rs': rf_rs_perf,
'gradient_boosted_trees': gbt_perf,
'gradient_boosted_trees_rs': gbt_rs_perf,
'xgboost': xgb_perf,
'xgboost_rs': xgb_rs_perf,
'light_gbm': lgbm_perf,
'light_gbm_rs': lgbm_rs_perf}
results_comparison = pd.DataFrame(results_dict).T
results_comparison
| accuracy | precision | recall | specificity | f1_score | cohens_kappa | roc_auc | pr_auc | |
|---|---|---|---|---|---|---|---|---|
| decision_tree_baseline | 0.720333 | 0.377530 | 0.407687 | 0.809116 | 0.392029 | 0.210775 | 0.608983 | 0.458213 |
| random_forest | 0.812667 | 0.637348 | 0.354936 | 0.942649 | 0.455954 | 0.353695 | 0.752173 | 0.527769 |
| random_forest_rs | 0.808833 | 0.622283 | 0.345139 | 0.940509 | 0.444014 | 0.339834 | 0.732148 | 0.491779 |
| gradient_boosted_trees | 0.815333 | 0.664168 | 0.333836 | 0.952065 | 0.444333 | 0.347838 | 0.775493 | 0.548495 |
| gradient_boosted_trees_rs | 0.804833 | 0.601036 | 0.349661 | 0.934089 | 0.442115 | 0.333720 | 0.749694 | 0.507030 |
| xgboost | 0.811667 | 0.631158 | 0.357197 | 0.940723 | 0.456208 | 0.352735 | 0.760476 | 0.527454 |
| xgboost_rs | 0.798833 | 0.569124 | 0.372268 | 0.919966 | 0.450114 | 0.333538 | 0.742334 | 0.497010 |
| light_gbm | 0.817000 | 0.656207 | 0.362472 | 0.946073 | 0.466990 | 0.367428 | 0.775582 | 0.548010 |
| light_gbm_rs | 0.798167 | 0.566514 | 0.372268 | 0.919110 | 0.449295 | 0.332151 | 0.741661 | 0.490796 |